Lecture 11: The delta method, quadratic regression, and polynomial regression

EBA3500 Data analysis with programming

Jonas Moss

BI Norwegian Business School
Department of Data Science and Analytics

Quadratic regression

Quadratic polynomial

The function \(f(x) = c + bx + ax^2\) is called a quadratic function or second-degree polynomial. It takes on four kinds of shapes, depending on the value of \(a,b,c\).

Quadratic regression

  • Let the true data generating model be \[Y = \beta_0 + \beta_1X + \beta_2X^2 + \epsilon.\] This a quadratic regression model. It can be regarded as a kind of non-linear regression model, but it usually isn’t.

  • Define \(X_1 = X\) and \(X_2 = X^2\). Then \[\beta_0 + \beta_1X + \beta_2X^2 = \beta_0 + \beta_1X_1 + \beta_2X_2,\] which implies that the quadratic regression model is actually a multiple linear regression model!

  • Why do we care? Because linear regression is much easier to compute than non-linear regression!

Is this a good idea?

  1. Quadratic functions have a very specific shape.
  2. But few natural phenomena adhere to this shape – except if you know they do (physics!)
  3. Often used by social scientists when data isn’t completely linear. And if the relationsship between y and x isn’t linear, a quadratic regression will always fit better in terms of e.g. the \(R^2\).
  4. Sometimes relationsships “flatten out”, and the quadratic curve will give a wrong impression.

An example

Five studies examined the relationship between talent and team performance. Two survey studies found that people believe there is a linear and nearly monotonic relationship between talent and performance: Participants expected that more talent improves performance and that this relationship never turns negative. However, building off research on status conflicts, we predicted that talent facilitates performance—but only up to a point, after which the benefits of more talent decrease and eventually become detrimental as intrateam coordination suffers. (Swaab et al., 2014)

So the authors claim there is no increasing relationsship between talent and performance at the top level. That seems plausible considering e.g. Martin Ødegaard!

They did four studies, as is common in psychology, and we will look at one of the football studies. Have a look at the paper for more details.

url = "https://gist.githubusercontent.com/JonasMoss/ae5436bf951da5b3e723ce6fec39e77f/raw/03148a170130686d95f020b81e27bc14b35705ff/talent.csv"
talent = pd.read_csv(url)
sns.lmplot(x = "talent", y = "points", data = talent)
<seaborn.axisgrid.FacetGrid at 0x2e924725f60>

The data is evidently not linear. So let’s try the logarithmic transform.

talent["log_talent"] = np.log(talent["talent"] + 1)
sns.lmplot(x = "log_talent", y = "points", data = talent)
<seaborn.axisgrid.FacetGrid at 0x2e92465e710>

This loooks quite linear!

import statsmodels.formula.api as smf
fit = smf.ols(formula = 'points ~ np.log(talent + 1)', data=talent).fit()
print(fit.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 points   R-squared:                       0.566
Model:                            OLS   Adj. R-squared:                  0.564
Method:                 Least Squares   F-statistic:                     268.8
Date:                Fri, 18 Nov 2022   Prob (F-statistic):           3.30e-39
Time:                        07:38:41   Log-Likelihood:                -1408.4
No. Observations:                 208   AIC:                             2821.
Df Residuals:                     206   BIC:                             2828.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept            273.8867     16.311     16.791      0.000     241.729     306.045
np.log(talent + 1)   247.0227     15.068     16.394      0.000     217.315     276.730
==============================================================================
Omnibus:                       25.340   Durbin-Watson:                   0.952
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               30.640
Skew:                           0.874   Prob(JB):                     2.22e-07
Kurtosis:                       3.692   Cond. No.                         1.60
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

<seaborn.axisgrid.FacetGrid at 0x2e924ae0550>

The fitted function is “U-shaped”. It appears that points first increases in talent, then decreases. At least when you look at the fitted curve!

fit = smf.ols(formula = 'points ~ talent + I(talent ** 2)', data=talent).fit()
print(fit.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 points   R-squared:                       0.464
Model:                            OLS   Adj. R-squared:                  0.459
Method:                 Least Squares   F-statistic:                     88.87
Date:                Fri, 18 Nov 2022   Prob (F-statistic):           1.61e-28
Time:                        07:38:41   Log-Likelihood:                -1430.3
No. Observations:                 208   AIC:                             2867.
Df Residuals:                     205   BIC:                             2877.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept        305.3440     17.627     17.323      0.000     270.591     340.097
talent            54.8979      5.469     10.039      0.000      44.116      65.680
I(talent ** 2)    -0.5702      0.075     -7.604      0.000      -0.718      -0.422
==============================================================================
Omnibus:                       18.548   Durbin-Watson:                   0.654
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               21.285
Skew:                           0.781   Prob(JB):                     2.39e-05
Kurtosis:                       3.128   Cond. No.                         988.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
  • Not surprisingly, the quadratic model has a smaller \(R^2\) (\(0.466\)) than the log-linear model (\(0.566\)). Along with just looking at the data, this provides evidence that there is no “U-shape” between talent and performance, but an increasing relationsship. Just as you’d expect!

Moral of the story

  1. Always look at the data. If it doesn’t look like it supports your hypothesis, it probably does not.
  2. Try out different models, some might fit much better the others.
  3. Do not blindly trust the qualitative consequences of models. Even if the quadratic model had a better fit than the log-model, it wouldn’t provide strong evidence for a U-shape – for a quadratic curve must be U-shaped!
  4. Even when \(X\) is univariate, it’s often hard to find regression functions that fits the data better than a log-transform or just an ordinary linear regression.
  5. But quadratic regressions are fine if you only want to predict stuff!

Modelling complicated univariate functions

How can you model non-linear relationsships? Not all functions are closely approximated by linears, e.g., np.sin(2 * np.pi * x) * x * np.exp(x).

Polynomial regression

  • Stone–Weierstrass theorem. Any continuous function can be approximated uniformly by polynomials.
  • A \(k\)th degree polynomial is \(f(x)=a_{0}+a_{1}x+a_{2}x^{2}+\ldots+a_{k}x^{k}\).
  • Idea: Fit polynomials to the data and select the order based on e.g. AIC.

Doing this in practice

  • You can use the np.vander function to construct design matrices consisting of polynomials.
def fit_model(deg):
  formula = "y ~ np.vander(x, " + str(deg) + ", increasing = True)-1"
  return smf.ols(formula, data = pd.DataFrame({"x" :x, "y":y})).fit()
  • Then we can fit \(k\)th degree polynomials!
[(deg, fit_model(deg).aic) for deg in range(5, 16)]
[(5, 770.3176318449504),
 (6, 765.6739087918303),
 (7, 722.3531744500278),
 (8, 631.4232378065403),
 (9, 611.365210202308),
 (10, 570.3460457476948),
 (11, 516.8364848836604),
 (12, 518.5146827024718),
 (13, 516.993964244756),
 (14, 518.9040946323105),
 (15, 520.2862547758479)]

Plotting this

Regression splines

  • Another option is splines.

  • They are piecewise polynomials glued together.

  • Just like polynomials, any function can be approximated by splines if the degree of freedom is high enough!

  • You can create spline bases using the bs function from patsy, e.g., y ~ bs(x, df = 5, degree = 3). The degree corresponds to the degree of the polynomial, three being the common option.

  • Details are not on the curriculum – but you should know that splines > polynomials for modeling of arbitrary shapes!

  • Take a look at this video for more information.

Splines from patsy documentation.

Using binary regression

We have already modelled this:

rng = np.random.default_rng(seed = 313)
x = np.linspace(0, 3, 100)
trues = np.sin(2 * np.pi * x) * x * np.exp(x)
y = trues + rng.normal(0, 3, 100)

But how about this?

z = 1 * (trues > 0)

Plotting splines

Define a fitter function for the desired degrees of freedom:

def fit_model(df):
  formula = "z ~ bs(x, degree = 3, " + "df = " + str(df) + ")"
  return smf.logit(formula, data = pd.DataFrame({"x" :x, "z":z})).fit()
print(fit_model(5).summary())
Optimization terminated successfully.
         Current function value: 0.458864
         Iterations 10
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      z   No. Observations:                  100
Model:                          Logit   Df Residuals:                       94
Method:                           MLE   Df Model:                            5
Date:                Fri, 18 Nov 2022   Pseudo R-squ.:                  0.3372
Time:                        07:38:42   Log-Likelihood:                -45.886
converged:                       True   LL-Null:                       -69.235
Covariance Type:            nonrobust   LLR p-value:                 6.551e-09
============================================================================================
                               coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------
Intercept                    7.5252      3.288      2.289      0.022       1.081      13.969
bs(x, degree=3, df=5)[0]   -12.2394      4.993     -2.451      0.014     -22.026      -2.453
bs(x, degree=3, df=5)[1]    -2.2668      2.716     -0.835      0.404      -7.590       3.057
bs(x, degree=3, df=5)[2]   -15.3265      4.839     -3.168      0.002     -24.810      -5.843
bs(x, degree=3, df=5)[3]     8.0071      4.657      1.719      0.086      -1.121      17.135
bs(x, degree=3, df=5)[4]   -65.6163     20.183     -3.251      0.001    -105.174     -26.058
============================================================================================

Possibly complete quasi-separation: A fraction 0.12 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.

On the regression data

def fit_model(df):
  formula = "y ~ bs(x, degree = 3, " + "df = " + str(df) + ")"
  return smf.ols(formula, data = pd.DataFrame({"x" :x, "y":y})).fit()
[(df, fit_model(df).aic) for df in range(3, 16)]
[(3, 769.919848597824),
 (4, 768.489935931196),
 (5, 770.497611464271),
 (6, 733.230555729813),
 (7, 585.9675275398621),
 (8, 573.4227183375275),
 (9, 575.0085899189091),
 (10, 543.2839496203144),
 (11, 519.8409155393351),
 (12, 517.3087684885444),
 (13, 513.5928894005049),
 (14, 515.7660420431214),
 (15, 517.8746224632054)]

The delta method

The problem

  • Let \(x_1, x_2, x_3, \ldots, x_m\) be iid Bernoulli with some unknown \(p\).

  • We can estimate \(p\) using \(\overline{x}\). Its variance is \(p(1-p)/n\).

  • But what happens if we need to estimate the odds \(p/(1-p)\) instead of \(p\)? What’s the variance then?

The delta method

Let \(g\) be a differentiable function and

\[ \sqrt{n}(\hat{\theta}_n -\theta)\stackrel{d}{\to}N(0,\sigma^2) \]

where \(\sigma^2\) is the asymptotic variance. Then

\[ \sqrt{n}(g(\hat{\theta}_n) -g(\theta))\stackrel{d}{\to}N(0,[g'(\theta)]^2\sigma^2) \]

Solving the problem

  • In our case \(g(p)=p/(1-p)\).

  • Its derivative is \(1/(1-p)^2\). (Found using the quotient rule.)

  • Recall that the asymptotic variance of \(\overline{x}\) (i.e., variance of \(\sqrt{n}(\overline{x}-p)\) ) is \(\sigma^2 = p(1-p)\).

  • Thus \(g'(p)^2\sigma^2=p(1-p)/(1-p)^4=p/(1-p)^3\).

  • Hence the variance of the estimated odds is \(\overline{x}/(1-\overline{x})\) is \(p/(1-p)^3/n\)

Applications

  • Suppose \(X\) has variance \(\sigma^2\) and unknown mean \(\mu\). How can we estimate \(\mu^2\) and what is its asymptotic variance?

    • Answer: We use \(\overline{x}^2\). The derivative of \(g(x)=x^2\) is \(g'(x)=2x\), hence \(g'(x)^2 = 4x^2\). It follows from the delta method that the variance is \(4\mu^2\sigma^2/n\).
  • Suppose \(\hat{\beta}_1\) is a regression coefficient from the logistic model. What is the asymptotic variance of \(\exp{\hat{\beta}_1}\)?

  • Answer: Recall that the entire parameter vector \(\hat{\beta}\) has covariance matrix equal to the inverse of the Fisher information \(I^{-1}\), hence the variance of \(\hat{\beta}_i\) is the \(i\)th diagonal element of \(I^{-1}\). Now apply the delta method to \(g(x)=\exp(x)\), yielding \((g'(x))^2 = \exp(2x)\). Hence the asymptotic variance of \(\exp{\hat{\beta}_1}\) is \(\exp(2\beta_1)I^{-1}_{ii}\).

Gradients

  • Recall that the gradient of function \(g:\mathbb{R}^k\to\mathbb{R}\) is \[\nabla g(\theta)=\left[\begin{array}{cccc} \frac{\partial g}{\partial\theta_{1}} & \frac{\partial g}{\partial\theta_{2}} & \cdots & \frac{\partial g}{\partial\theta_{k}}\end{array}\right]\]

  • Example. Consider the function \[g(\theta)=\theta_{1}^{2}+\theta_{2}^{2}+\theta_{3}^{2}\]

  • The gradient is \[\nabla g(\theta)=\left[\begin{array}{c} \frac{\partial g}{\partial\theta_{1}}\\ \frac{\partial g}{\partial\theta_{2}}\\ \frac{\partial g}{\partial\theta_{k}} \end{array}\right]=2\left[\begin{array}{c} \theta_{1}\\ \theta_{2}\\ \theta_{3} \end{array}\right]\]

Delta method with vectors

Assume that \(\theta\) is a vector and \[\sqrt{n}(\hat{\theta}-\theta)\stackrel{d}{\to}N(0,\Sigma).\] The matrix \(\Sigma\) is the asymptotic covariance matrix.

\[\sqrt{n}(g(\hat{\theta})-g(\theta))\stackrel{d}{\to}N(0,\nabla g(\theta)^{T}\Sigma\nabla g(\theta))\]

Applications

  • Question: Remember what the asymptotic covariance of the regression coefficients in a binary regression model is?
    • Answer: It’s the inverse Fisher information: \(I^{-1}(\beta)\)! You can find the estimated covariance matrix using cov_params().
  • Question: How would you find the asymptotic covariance matrix of \(\hat{\theta} = \sum_{i=1}^p \hat{\beta}_i^2\), where \(\hat{\beta}_i\) are the estimated coefficients from a logistic regression model?
    • Answer: Use the delta method together with the inverse Fisher informatio. The asymptotic covariance matrix can be estimated as n*model.cov_params().

Inference for the probability in binary regression

  • Recall that \(E(Y\mid X) = F(\beta^T X)\) where \(F\) is the inverse link function (with derivative \(f\)). What is the asymptotic variance of \(F(\hat{\beta}^TX)\)?
  • Putting \(g(\beta)=F(\beta^{T}X)\) we get \[ \nabla g(\theta)=f(\beta^{T}X)\left[\begin{array}{c} X_{1}\\ X_{2}\\ \vdots\\ X_{k} \end{array}\right]\]
  • Hence, if \(\Sigma\) is the asymptotic covariance matrix of \(\hat{\beta}\), the asymptotic variance of \(F(\beta^{T}X)\) is \[\tau = f^2(\beta^{T}X)X^{T}\Sigma X.\]
  • Again, construct confidence intervals using \(F(\hat{\beta}^{T}X) \pm \tau^{1/2} n^{-1/2} \Phi^{-1}(0.975)\)

Admission data

import pandas as pd
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
admission = pd.read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
admission.head()
admit gre gpa rank
0 0 380 3.61 3
1 1 660 3.67 3
2 1 800 4.00 1
3 1 640 3.19 4
4 0 520 2.93 4

Admission data: Summary

model = smf.logit("admit ~ gre + gpa + C(rank)", data = admission).fit()
print(model.summary())
Optimization terminated successfully.
         Current function value: 0.573147
         Iterations 6
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                  admit   No. Observations:                  400
Model:                          Logit   Df Residuals:                      394
Method:                           MLE   Df Model:                            5
Date:                Fri, 18 Nov 2022   Pseudo R-squ.:                 0.08292
Time:                        07:38:44   Log-Likelihood:                -229.26
converged:                       True   LL-Null:                       -249.99
Covariance Type:            nonrobust   LLR p-value:                 7.578e-08
================================================================================
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept       -3.9900      1.140     -3.500      0.000      -6.224      -1.756
C(rank)[T.2]    -0.6754      0.316     -2.134      0.033      -1.296      -0.055
C(rank)[T.3]    -1.3402      0.345     -3.881      0.000      -2.017      -0.663
C(rank)[T.4]    -1.5515      0.418     -3.713      0.000      -2.370      -0.733
gre              0.0023      0.001      2.070      0.038       0.000       0.004
gpa              0.8040      0.332      2.423      0.015       0.154       1.454
================================================================================

Admission data: cov_params()

  • We want to estimate the probability of a guy with GRE equal to \(700\) and GPA equal to \(3.00\) being admitted to a rank 4 college – and we want a confidence interval for this probability!
  • The asymptotic covariance matrix is available using cov_params().
print(model.cov_params())
              Intercept  C(rank)[T.2]  C(rank)[T.3]  C(rank)[T.4]       gre  \
Intercept      1.299488     -0.084476     -0.048644     -0.089431 -0.000301   
C(rank)[T.2]  -0.084476      0.100166      0.069566      0.070127 -0.000002   
C(rank)[T.3]  -0.048644      0.069566      0.119237      0.069742  0.000019   
C(rank)[T.4]  -0.089431      0.070127      0.069742      0.174583  0.000012   
gre           -0.000301     -0.000002      0.000019      0.000012  0.000001   
gpa           -0.303660      0.004521     -0.009469      0.003568 -0.000124   

                   gpa  
Intercept    -0.303660  
C(rank)[T.2]  0.004521  
C(rank)[T.3] -0.009469  
C(rank)[T.4]  0.003568  
gre          -0.000124  
gpa           0.110104  

Admission data

  • Since \(F(x) = 1/(1+\exp{(-x)})\) in the logistic model, its derivative is \(f(x=\exp{(-x)}/(1+\exp{(-x)})^2\).
import numpy as np
f = lambda x: np.exp(-x)/(1 + np.exp(-x)) ** 2
x = np.array([1, 0, 0, 1, 700, 3])
tau = x @ model.cov_params() @ x * f(x @ model.params) ** 2
print(tau)
0.0030549144448521514
  • And the 95% confidence interval is, using that \(\Phi^{-1}(0.975) \approx 1.96\).
F = lambda x: 1/(1+np.exp(-x))
estimate = F(x @ model.params)
CI = [estimate - np.sqrt(tau) * 1.96, estimate + np.sqrt(tau) * 1.96]
print(CI)
[0.06758919117218386, 0.2842526106166671]